Computing semantic relatedness of words and texts in Wikipedia-derived semantic space
نویسندگان
چکیده
Adequate representation of natural language semantics requires access to vast amounts of common sense and domain-specific world knowledge. Prior work in the field was either based on purely statistical techniques that did not make use of background knowledge or on huge manual efforts, such as the CYC projects. Here we propose a novel method, called Explicit Semantic Analysis (ESA), for finegrained semantic interpretation of unrestricted natural language texts. Our method represents meaning in a high-dimensional space of concepts derived from Wikipedia, the largest encyclopedia in existence. We use machine learning techniques that allow us to explicitly represent the meaning of any text in terms of Wikipedia-based concepts. We evaluate the effectiveness of our method on automatically computing the degree of semantic relatedness between fragments of natural language text. Compared with the previous state of the art, using ESA results in substantial improvements in correlation of computed relatedness scores with human judgments: from r = 0.56 to 0.75 for individual words and from r = 0.60 to 0.72 for texts. Consequently, we anticipate ESA to give rise to the next generation of natural language processing tools. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users.
منابع مشابه
Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis
Computing semantic relatedness of natural language texts requires access to vast amounts of common-sense and domain-specific world knowledge. We propose Explicit Semantic Analysis (ESA), a novel method that represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia. We use machine learning techniques to explicitly represent the meaning of any text as a weight...
متن کاملWikipedia-Based Semantic Interpreter Using Approximate Top-k Processing and Its Application
Proper representation of the meaning of texts is crucial for enhancing many data mining and information retrieval tasks, including clustering, computing semantic relatedness between texts, and searching. Representing of texts in the concept-space derived from Wikipedia has received growing attention recently. This concept-based representation is capable of extracting semantic relatedness betwee...
متن کاملWikipedia-based Compact Hierarchical Semantics with Application to Semantic Relatedness
A proper semantic representation of words and texts underlies many text processing tasks. In this paper, we present a novel representation of semantics which is based on an hierarchical ontology of natural concepts derived from Wikipedia articles and category system. Our method, called Compact Hierarchical Explicit Semantic Analysis (CHESA) generates compact hierarchical representations of unre...
متن کاملWikiWalk: Random walks on Wikipedia for Semantic Relatedness
Computing semantic relatedness of natural language texts is a key component of tasks such as information retrieval and summarization, and often depends on knowledge from a broad range of real-world concepts and relationships. We address this knowledge integration issue with a method of computing semantic relatedness using personalized PageRank (random walks) on a graph derived from Wikipedia. T...
متن کاملComputing Semantic Similarity of Documents Based on Semantic Tensors
Exploiting semantic content of texts due to its wide range of applications such as finding related documents to a query, document classification and computing semantic similarity of documents has always been an important and challenging issue in Natural Language Processing. In this paper, using Wikipedia corpus and organizing it by three-dimensional tensor structure, a novel corpus-based approa...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006